Conceptual


Q1

set.seed(1)
error = 0.7*rnorm(100)
x1 = rnorm(100, mean = -1.5, sd = 0.7)
x2 = rnorm(100, mean = 0.8, sd = 1.5)
beta0 = 1.6; beta1 = 0.8; beta2 = 1.7
y = beta0 + beta1*x1 + beta2*x2 + error
plot(x1, y, col = "darkgrey")

plot(x2, y, col = "darkgrey")

plot(x1, x2, col = "darkgrey")

library(tree)
sim.tree = tree(y~., data = data.frame(x1 = x1, x2 = x2, y = y))
summary(sim.tree)
## 
## Regression tree:
## tree(formula = y ~ ., data = data.frame(x1 = x1, x2 = x2, y = y))
## Number of terminal nodes:  7 
## Residual mean deviance:  0.738 = 68.63 / 93 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2.371000 -0.466600 -0.001546  0.000000  0.490600  2.715000
plot(sim.tree)
text(sim.tree, pretty = 0)

sim.tree
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 100 748.300  1.8910  
##    2) x2 < 0.728544 47 187.000 -0.2784  
##      4) x2 < -1.21484 9  15.830 -3.6370 *
##      5) x2 > -1.21484 38  45.610  0.5170  
##       10) x2 < -0.227058 11   7.299 -0.6156 *
##       11) x2 > -0.227058 27  18.450  0.9785 *
##    3) x2 > 0.728544 53 144.100  3.8140  
##      6) x2 < 1.88623 29  20.170  2.6820  
##       12) x1 < -1.22483 21   6.062  2.2890 *
##       13) x1 > -1.22483 8   2.332  3.7150 *
##      7) x2 > 1.88623 24  41.840  5.1820  
##       14) x2 < 3.2566 18  11.950  4.6150 *
##       15) x2 > 3.2566 6   6.703  6.8840 *

Q2

Boosting with d=1 is actually a additive model, because \(\hat f(x)=\sum_{b=1}^B\lambda\hat{f^b}(x)\), for every generated tree \(\hat{f^b}(x)\), it must be quivalent to one \(f_j(x_j)\). Therefore \(\hat f(x)=\lambda\sum_{b=1}^B\hat{f^b}(x)\) is a weighted version of additive model.

Q3

Classification Error:\(E=1-\max\limits_k\hat{p}_{mk}\)

Gini Index: \(G=\sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk})\)

Cross-entropy: \(D=-\sum_{k=1}^K\hat{p}_{mk}\log{\hat{p}_{mk}}\)

pm1=seq(0,1,0.1)
pm2=1-pm1
class=1-apply(cbind(pm1, pm2), 1, max)
gini=2*pm1*(1-pm1)
entropy = -pm1*log(pm1)-pm2*log(pm2)
matplot(pm1,cbind(class, gini, entropy),type='l',col = 2:4)
legend("topleft", c("Classification error", "Gini index", "Cross entropy"), col = 2:4 ,lty = 1)

Q5

For majority vote: red

For average probability: \(\frac{0.1+0.15+0.2+0.2+0.55+0.6+0.6+0.65+0.7+0.75}{10}=0.45\), green.

Q6

1.Build a tree. Firstly, we tale recursive binary splitting. In each splitting step, this approach select the predictor and the cutpoint such that splitting the predictor space into two regions leads to the greatest reduction in RSS.

That is for any j and s, we define the pair of half place \(R_1(j,s)={X|X_j<s}\) and \(R_2(j,s)={X|X_J>=s}\), and we seek the value of j and s that minimize the equation \(\sum\limits_{i:x_i\in R_1(j,s)}(y_i-\hat{y}_{R_1})^2+\sum\limits_{i:x_i\in R_2(j,s)}(y_i-\hat{y}_{R_2})^2\). This process will be repeated until a stopping criterion is reached.

  1. Tree Pruning. Apply cost complexity pruning (aka weakest link pruning) to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\). For each value of \(\alpha\) there corresponds a subtree \(T \in T_0\) such that \(\sum\limits_{m=1}^{|T|}\sum\limits_{x_i\in R_m}(y_i-\hat y_{R_m})^2+\alpha|T|\) is as small as possible.

3 use k-fold cross-validation to choose \(\alpha\). For each k, repeat step (1) and (2) on all but the kk th fold of the training data while evaluate the mean squared prediction error on the held-out k th fold as a function of \(\alpha\). Afterwards, average the results for each \(\alpha\) and pick the best \(\alpha\) which minimizes the average error.

4.Return subtree with best \(\alpha\). Generate and return the subtree with step (2), provided the best \(\alpha\).

Q7

library(MASS)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1)
train = sample(1:nrow(Boston), nrow(Boston)*7/10)
y=Boston[-train,'medv']
p=ncol(Boston)-1
mt=c(p,p/2,sqrt(p))
ntr=seq(1,500,5)
error=matrix(NA,length(mt),length(ntr))

for(i in 1:length(mt)){
  for(j in 1:length(ntr)){
    random.boston=randomForest(medv~.,data=Boston,subset=train,mtry=mt[i],ntree=ntr[j])
    y_pred=predict(random.boston,newdata=Boston[-train,])
    error[i,j]=mean((y_pred-y)^2)
  }
}
matplot(ntr,t(error),type = "l",col = 2:4, xlab = "Number of Trees", ylab = "Test MSE")
legend("topright", expression("m = p(bagging)", "m = p/2", "m = "*sqrt(p)), col = 2:4, lty = 1)

Q8

(a)

library(ISLR)
train=sample(1:nrow(Carseats),nrow(Carseats)*7/10)
y.test=Carseats[-train,'Sales']

(b)

tree.car=tree(Sales~.,Carseats,subset=train)
plot(tree.car)
text(tree.car,pretty = 0)

yhat=predict(tree.car,newdata=Carseats[-train,])
mean((yhat-y.test)^2)
## [1] 5.073585

(c)

cv.car=cv.tree(tree.car)
plot(cv.car$size,cv.car$dev,type='b')

index=cv.car$size[which.min(cv.car$dev)]
prune.car = prune.tree(tree.car, best = index)
plot(prune.car)
text(prune.car, pretty = 0)

tree.pred = predict(prune.car, Carseats[-train,])
mean((tree.pred - y.test)^2)
## [1] 5.073585

As shown above, pruning the tree doesn’t improve the test MSE.

(d)

bag.car=randomForest(Sales~.,data=Carseats,subset=train,mtry=11,importance=TRUE)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within
## valid range
yhat.bag=predict(bag.car,Carseats[-train,])
mean((yhat.bag-y.test)^2)
## [1] 2.199577
importance(bag.car)
##                %IncMSE IncNodePurity
## CompPrice   30.2562568    222.336004
## Income       9.1186331    116.469126
## Advertising 28.2342288    213.595227
## Population  -1.2217989     65.261001
## Price       66.0425813    584.632317
## ShelveLoc   69.3397797    757.328449
## Age         15.8603024    181.969171
## Education    3.9624509     57.046358
## Urban       -0.4222321     12.181904
## US           0.1697393      6.849683

(e)

random.car=randomForest(Sales~.,data=Carseats,subset=train,mtry=sqrt(ncol(Carseats)-1),ntree = 500)
yhat.bag=predict(random.car,Carseats[-train,])
mean((yhat.bag-y.test)^2)
## [1] 2.876925
importance(bag.car)
##                %IncMSE IncNodePurity
## CompPrice   30.2562568    222.336004
## Income       9.1186331    116.469126
## Advertising 28.2342288    213.595227
## Population  -1.2217989     65.261001
## Price       66.0425813    584.632317
## ShelveLoc   69.3397797    757.328449
## Age         15.8603024    181.969171
## Education    3.9624509     57.046358
## Urban       -0.4222321     12.181904
## US           0.1697393      6.849683
varImpPlot(random.car)

p = ncol(Carseats) - 1 
errors = c()
for ( d in seq(1, p) ) {
  rf.carseats = randomForest(Sales~., data = Carseats, subset = train, mtry = d, ntree = 50)
  rf.pred = predict(rf.carseats, Carseats[-train,])
  errors[d] = mean((rf.pred - y.test)^2)
}
plot(errors, col = "red", type = "b", lwd = 2, xlab = "m", ylab = "Test MSE")